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Abstract 

Record  deduplication  is  the  task  of  merging  database  records  that  refer 
to  the  same  underlying  entity.  In  relational  databases,  accurate  dedupli¬ 
cation  for  records  of  one  type  is  often  dependent  on  the  merge  decisions 
made  for  records  of  other  types.  Whereas  nearly  all  previous  approaches 
have  merged  records  of  different  types  independently,  this  work  models 
these  inter-dependencies  explicitly  to  collectively  deduplicate  records  of 
multiple  types.  We  construct  a  conditional  random  field  model  of  dedu¬ 
plication  that  captures  these  relational  dependencies,  and  then  employ  a 
novel  relational  partitioning  algorithm  to  jointly  deduplicate  records. 

We  evaluate  the  system  on  two  citation  matching  datasets,  for  which 
we  deduplicate  both  papers  and  venues.  We  show  that  by  collectively 
deduplicating  paper  and  venue  records,  we  obtain  up  to  a  30%  error  re¬ 
duction  in  venue  deduplication,  and  up  to  a  20%  error  reduction  in  paper 
deduplication  over  competing  methods. 


1  Introduction 

A  common  prerequisite  for  knowledge  discovery  is  accurately  combining  data 
from  multiple,  heterogeneous  sources  into  a  unified,  mineable  database.  An 
important  step  in  creating  such  a  database  is  record  deduplication:  consolidating 
multiple  records  that  refer  to  the  same  abstract  entity.  The  difficulty  in  this 
task  arises  both  from  data  errors  (e.g.  misspellings  and  missing  fields)  and  from 
variants  in  field  values  (e.g.  abbreviations). 

Most  historical  approaches  have  framed  the  deduplication  problem  as  a  set  of 
independent  decisions.  For  each  pair  of  records,  a  similarity  score  is  calculated, 
and  the  records  are  merged  if  the  similarity  is  above  some  threshold  [8].  The 
decisions  are  combined  by  taking  the  transitive  closure  of  the  resulting  adjacency 
matrix. 

More  recently,  McCallum  and  Wellner  [13]  and  Parag  and  Domingos  [20] 
have  demonstrated  that  making  multiple  deduplication  decisions  collectively 
can  provide  better  results  than  historical  approaches.  These  models  are  types 
of  conditional  random  fields  (CRFs)  [9],  where  the  observed  nodes  are  mentions, 
and  the  predicted  nodes  are  the  deduplication  decisions  for  each  pair  of  nodes. 
By  framing  inference  as  an  instance  of  graph  partitioning,  the  models  are  “col¬ 
lective”  in  the  sense  that  mentions  are  clustered  based  not  only  on  their  distance 
to  each  other,  but  also  on  their  distance  from  all  other  partitions.  By  treating 
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deduplication  decisions  in  dependent  relation  to  each  other,  inconsistencies  and 
noise  in  the  similarity  metric  may  be  overcome. 

This  paper  presents  a  model  for  collective  deduplication,  extended  to  the  im¬ 
portant  and  ubiquitous  case  of  relational  databases,  where  records  have  types, 
and  where  there  exist  relations  between  records  of  different  types.  These  rela¬ 
tions  provide  useful  evidence  for  deduplication  decisions  because  the  identity  of 
a  record  often  depends  on  the  identities  of  related  records. 

For  example,  consider  a  database  of  research  papers,  where  records  can  be 
of  type  paper,  venue,  or  author.  If  two  paper  records  are  labeled  as  duplicates, 
then  it  follows  that  the  venue  records  corresponding  to  those  papers  should  also 
be  labeled  as  duplicates.  The  reverse  is  more  subtly  true:  if  two  venues  are  du¬ 
plicates,  then  this  may  slightly  increase  the  probability  that  their  corresponding 
papers  are  duplicates. 

We  propose  a  model  that  leverages  these  subtle  interdependencies  to  make 
deduplication  decisions  collectively  across  multiple  record  types. 

In  particular,  we  present  a  CRF  for  the  citation  domain  that  provides  a  con¬ 
ditional  probabilistic  model  of  deduplication  decisions  over  records  of  multiple 
types  given  observed  record  mentions  and  the  relations  among  them.  We  pro¬ 
pose  a  novel,  relational  graph  partitioning  algorithm  for  inference  that  not  only 
ensures  that  deduplication  decisions  made  for  different  record  types  are  consis¬ 
tent,  but  also  allows  the  decisions  from  one  record  type  to  inform  the  decisions 
for  another  record  type. 

Parameter  estimation  consists  of  maximizing  the  product  of  local  marginals 
for  pairs  of  records  of  different  types.  That  is,  we  parameterize  the  CRF  to 
learn  weights  over  4-tuples  consisting  of  a  record  pair  and  a  related  record  pair 
of  a  different  type.  In  the  citation  domain,  these  4-tuples  consist  of  a  pair  of 
paper  records,  and  a  pair  of  related  venue  records.  In  this  way,  the  model  learns 
parameters  to  trade-off  paper  and  venue  deduplication  decisions. 

We  provide  results  on  a  database  of  research  papers,  where  we  show  that 
modeling  deduplication  of  paper  and  venue  records  collectively  improves  dedu¬ 
plication  performance  for  each  type,  providing  up  to  a  30%  error  reduction  in 
venue  deduplication,  and  up  to  a  20%  error  reduction  in  paper  deduplication 
over  a  previously  proposed  collective  model  [13]  that  does  not  model  the  depen¬ 
dencies  between  record  types. 

2  Related  Work 

To  the  best  of  our  knowledge,  this  is  the  first  paper  to  present  a  discriminative, 
collective  model  of  deduplication  for  multiple,  related  record  types  and  demon¬ 
strate  empirically  the  performance  gains  attainable  over  independent  models. 
We  briefly  review  classical  work  in  deduplication,  then  discuss  recent  efforts  in 
collective  deduplication. 

Record  deduplication,  known  variously  as  record  linkage,  coreference  res¬ 
olution,  deduplication,  and  identity  uncertainty,  is  prevalent  in  many  fields, 
including  computer  vision,  databases,  and  natural  language  processing. 
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Originally  introduced  in  the  database  community  as  “record  linkage”  [17], 
record  deduplication  was  later  formalized  by  Fellegi  and  Sunter  [8]  as  the  compu¬ 
tation  over  features  between  pairs  of  records,  and  further  extended  by  Winkler 
[25,  26].  This  previous  work  calculates  a  similarity  score  for  record  pairs,  col¬ 
lapses  those  above  a  similarity  threshold,  then  performs  transitive  closure.  It 
is  not  relational  in  the  sense  that  one  deduplication  decision  does  not  directly 
affect  another. 

More  recent  record  linkage  work  has  considered  the  deduplication  of  categor¬ 
ical  data,  allowing  attributes  to  be  deduplicated  along  with  records  [1] ;  however, 
that  work  does  not  utilize  machine  learning  and  requires  thresholds  to  be  set 
manually. 

Methods  of  learning  a  better  similarity  score  have  been  investigated  recently 
in  the  database  community  [5,  7].  Similar  trends  exist  in  natural  language 
processing  for  the  task  of  coreference  resolution,  where  research  has  focused 
on  learning  more  useful  similarity  metrics  and  applying  them  to  thresholding 
techniques  analogous  to  those  found  in  the  database  community  [18,  16]. 

Only  recently  have  collective  deduplication  models  been  investigated.  Milch 
et.  al  have  introduced  generative  models  to  reason  in  worlds  with  an  unknown 
number  of  objects,  enabling  probability  distributions  to  be  defined  over  rela¬ 
tional  data  with  many  object  types  [11,  15].  While  these  models  have  appealing 
formal  semantics,  their  generative  nature  forces  the  model  to  make  conditional 
independence  assumptions  among  features. 

Our  work  can  be  viewed  as  extensions  to  recent  models  applying  conditional 
random  fields  to  the  deduplication  task  [13,  24,  20].  McCallum  and  Wellner  [13] 
have  presented  CRFs  which  perform  collective  coreference  by  equating  inference 
in  the  CRF  as  a  graph  partitioning  problem,  resulting  in  collective  coreference  of 
records  of  one  type.  This  previous  work  demonstrated  the  advantages  collective 
coreference  has  over  classical  approaches;  however,  it  does  not  model  multiple 
types  of  coreferent  objects. 

Parag  and  Domingos  [20]  present  a  CRF  model  similar  to  that  in  McCal¬ 
lum  and  Wellner  [13]  in  that  it  collectively  deduplicates  records  of  the  same 
type  using  graph  partitioning.  Additionally,  this  method  allows  information 
to  propagate  between  records  by  way  of  their  shared  attributes.  However,  the 
Parag  and  Domingos  model  does  not  treat  attributes  as  first-class  objects.  In 
particular,  their  model  collapses  string  identical  attribute  nodes  and  creates  “in¬ 
formation  nodes”  to  model  whether  or  not  attributes  match.  The  model  does 
not  explicitly  optimize  deduplication  decisions  for  attributes;  rather,  the  “infor¬ 
mation  nodes”  can  be  viewed  as  an  input  variables  to  record  deduplication.  An 
important  distinction  with  our  work  is  that  in  the  Parag  and  Domingos  model, 
joint  deduplication  only  occurs  among  records  sharing  an  identical  attribute. 
This  is  often  not  the  case  in  real  data. 

In  a  sense,  the  Parag  and  Domingos  model  can  be  viewed  as  a  discriminative 
version  of  a  recently  proposed  hierarchical  model  for  deduplication  by  Raviku- 
mar  and  Cohen  [22],  which  introduces  latent  match  nodes  for  attributes.  Here 
again,  determining  whether  attribute  values  are  coreferent  is  viewed  as  a  local 
decision  used  as  input  to  the  record  deduplication  decision.  Our  model  instead 
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PID 

Author 

Title 

Venue 

VID 

0 

X.  Li 

Predicting  the  stock  market 

CIKM 

10 

1 

X.  Li 

Predicting  the  stock  market 

Conf  on  Information  Management 

20 

2 

J.  Smith 

Semi-Definite  Programming 

CIKM 

30 

3 

Smith,  J. 

Semi-Dehnate  Programing 

Conference  on  Info  Management 

40 

Table  1:  An  example  of  four  papers  with  the  same  venues  in  a  publications 
database.  PID  is  the  paper  id,  and  VID  is  the  venue  id. 


treats  attributes  as  records  themselves,  performing  full  deduplication  on  them 
as  well  as  their  related  records. 

Other  recent  work  in  knowledge  discovery  has  leveraged  relational  infor¬ 
mation  to  perform  coreference  [3,  4].  These  models  define  a  similarity  metric 
between  records  that  considers  the  identity  of  related  records.  This  is  similar  in 
spirit  to  our  model,  since  it  uses  deduplication  decisions  of  related  records  to  cal¬ 
culate  the  similarity  between  records.  However,  their  model  is  mainly  concerned 
with  deduplicating  authors  alone,  and  does  not  explicitly  model  deduplication 
of  multiple  record  types.  Also,  the  training  methods  described  in  those  models 
do  not  capture  the  rich  set  of  features  available  to  the  model  presented  in  this 
paper. 

Our  model  can  also  be  described  as  a  type  of  relational  Markov  network 
(RMN)  [23],  which  have  been  employed  successfully  in  relational  domains,  al¬ 
though  not  multi-type  deduplication.  Another  important  distinction  is  that 
our  training  and  inference  methods  differ  substantially  from  the  loopy  belief 
propagation  algorithm  commonly  used  in  RMNs. 


3  Motivating  Example 

We  first  provide  an  example  to  motivate  the  potential  benefits  of  collective 
deduplication  for  databases  with  multiple  record  types. 

Consider  again  a  database  of  research  papers,  with  author,  paper,  and  venue 
records.  Our  task  is  to  deduplicate  the  various  mentions  of  these  records  into 
unique  entities.  Table  1  shows  a  database  of  four  papers  and  four  venues,  each 
with  unique  ids.  Papers  0  and  1  and  papers  2  and  3  should  be  merged;  all  the 
venues  should  be  merged. 

Imagine  an  agglomerative  deduplication  system  which  begins  by  assuming 
each  record  is  unique.  Suppose  the  system  first  considers  merging  papers  0  and 
1.  Although  the  venues  do  not  match,  all  the  other  fields  are  exact  matches, 
so  it  is  feasible  that  the  system  may  overcome  this  discrepancy.  After  merging 
papers  0  and  1,  the  system  also  merges  the  corresponding  venues  10  and  20 
into  the  same  cluster,  since  the  venues  of  duplicate  papers  must  themselves  be 
duplicates. 

Imagine  the  system  next  merges  venues  10  and  30  because  they  are  string 
identical.  The  system  must  now  decide  if  papers  2  and  3  are  duplicates.  Treated 
in  isolation,  a  system  may  have  a  hard  time  correctly  detecting  that  2  and 
3  are  duplicates:  the  authors  are  highly  similar,  but  the  title  contains  two 
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(a) 


(b) 


(c) 


Figure  1:  Three  increasingly  complex  models  of  paper  and  venue  deduplication. 
X“  and  Xb  are  paper  and  venue  records,  respectively.  Y  is  a  binary  random 
variable  indicating  duplicate  records,  and  Rfj1  denotes  whether  paper  Y“  was 
published  in  venue  Yj'. 

misspellings,  and  the  venues  are  extremely  dissimilar. 

However,  the  system  we  have  described  so  far  is  fortunate  to  have  more 
information  at  its  disposal.  It  has  already  merged  venues  10  and  20,  which  are 
highly  similar  to  venues  30  and  40.  By  consulting  its  database  of  deduplicated 
venues,  it  could  determine  that  30  and  40  are  in  fact  the  same  venue.  With  this 
information  in  hand,  it  may  be  more  forgiving  of  the  spelling  mistakes  in  the 
title,  finally  merging  papers  2  and  3  correctly. 

This  example  illustrates  the  notion  that  the  identity  of  an  object  is  dependent 
on  the  identity  of  related  objects.  Notice  that  by  using  relational  information, 
the  system  not  only  merged  papers  it  might  not  have  otherwise  (2  and  3),  but 
also  merged  venues  it  might  not  have  (10,  20  and  30,  40).  Indeed,  the  chain  of 
deduplication  decisions  which  led  to  optimal  performance  interleaved  paper  and 
venue  decisions:  (0,1),  (10,20),  (30,40),  (2,3). 

The  work  presented  here  describes  a  system  that  models  the  deduplication 
decisions  of  related  records  collectively,  enabling  the  sort  of  probabilistic  trade¬ 
offs  instrumental  to  the  success  of  the  system  in  this  example. 

4  Model 

The  model  is  an  instance  of  a  conditional  random  field  that  jointly  models  the 
conditional  probability  of  multiple  deduplication  decisions  given  an  observed 
relational  database. 

We  begin  with  a  brief  review  of  conditional  random  fields,  followed  by  a 
formal  description  of  the  model.  We  then  describe  the  approximations  used  to 
make  inference  and  parameter  estimation  tractable  for  this  model. 
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4.1  Conditional  Random  Fields 

Conditional  random  fields  (CRFs)  [9]  are  undirected  graphical  models  encoding 
the  conditional  probability  of  a  set  of  output  variables  Y  given  a  set  of  evidence 
variables  X.  The  set  of  distributions  expressible  by  a  CRF  is  specified  by  an 
undirected  graph  Q,  where  each  vertex  corresponds  to  a  random  variable.  If 
C  =  {{yc,xc}}  is  the  set  of  cliques  in  Q ,  then  the  conditional  probability  of  y 
given  x  is 

PA(y|x)  =  —  </>c(yc,xc;  A) 

x  c ec 

where  ^  is  a  potential  function  parameterized  by  A  and  Zx  =  JlceC  0(y c,  xc) 

is  a  normalization  factor.  We  assume  <j>c  factorizes  as  a  log-linear  combination 
of  arbitrary  features  computed  over  clique  c,  therefore 

^c(yc,xc;  A)  =  exp  I  ^  Afc/fe(yc,xc) 

\  k 

The  model  parameters  A  =  { A^}  are  a  set  of  real- valued  weights  typically 
learned  from  labeled  training  data  by  maximum  likelihood  estimation. 

4.2  CRFs  for  multi-type  deduplication 

Let  X  be  a  collection  of  random  variables  representing  observed  record  mentions 
in  a  database  that  requires  deduplication.  For  clarity,  assume  there  are  only  two 
types  of  records,  X  =  (Xa,Xfc),  where  Xa  =  (Xf, . . .  ,X“),  X&  =  (X\, . . .  ,Xbm). 
The  goal  of  deduplication  is  to  partition  X  into  clusters  of  records  that  all  refer 
to  the  same  abstract  entity. 

To  this  end,  we  define  a  collection  of  binary  random  variables  Y  =  (Ya, 
Yfc)  that  indicate  whether  or  not  two  records  are  duplicates.  For  example,  Yf 
indicates  whether  or  not  records  X?  and  X f  are  coreferent.  We  also  define 
the  binary  random  variables  R,  where  Rff  indicates  whether  some  arbitrary 
relation  R  holds  between  record  mentions  Xf  and  Xf. 

L  J 

For  example,  in  a  research  paper  database,  Xa  represents  the  set  of  paper 
records,  X6  represents  the  set  venue  records,  Yf  indicates  whether  Xf  and  Xf 
are  duplicates,  and  Rfb  indicates  whether  paper  Xf  was  published  at  venue  Xf. 

In  the  general  case  where  R  is  unobserved,  one  could  construct  the  condi¬ 
tional  distribution  P(Ya,  Yb,  R|X).  With  this  model,  one  can  infer  from  an 
observed  set  of  records  the  most  probable  set  of  duplicate  records  and  the  most 
probable  set  of  relations  between  records.  For  example,  in  the  publications  do¬ 
main,  one  may  want  to  model  the  advisor_of  relation  between  authors,  while 
also  modeling  author  deduplication. 

We  postpone  this  investigation  for  future  work,  and  instead  focus  on  the 
case  where  R  is  observed.  For  instance,  in  citation  data,  we  know  which  venues 
records  are  related  to  which  paper  records.  Thus,  we  desire  to  model  the  con¬ 
ditional  distribution  P(Ya,  Yb|X,  R). 
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Figure  1  displays  three  increasingly  complex  graphical  models  of  this  con¬ 
ditional  distribution.  Vertices  are  random  variables,  edges  indicate  a  possible 
probabilistic  dependence  between  variables,  and  shaded  vertices  indicate  ob¬ 
served  variables.  Model  (a)  corresponds  to  the  classical  approach,  which  treats 
each  duplication  decision  independently.  Model  (b)  is  the  approach  evaluated  in 
McCallum  and  Wellner  [13],  extended  here  to  the  case  of  multiple  record  types. 
Note  that  in  this  model  deduplication  decisions  for  records  of  the  same  type  are 
made  collectively. 

Model  (c)  is  the  one  this  paper  advocates.  Not  only  are  deduplication  deci¬ 
sions  for  records  of  the  same  type  made  collectively,  but  also  the  decisions  for 
one  type  of  record  are  dependent  on  decisions  made  for  related  records.  For  ease 
of  presentation,  we  have  only  included  the  observed  relation  variables  R  which 
are  true. 

We  now  provide  a  more  precise  description  of  model  (c) . 

Let  xfj  =  (xf,  Xj,  x\,  Xj)  be  a  pair  of  observed  paper  record  mentions  and 
their  corresponding  venue  records.  To  capture  the  dependence  between  and 
we  factorize  the  potential  functions  to  consider  them  jointly,  resulting  in 
the  model: 


p(  y“,yb|x,r) 


1 

exP 


(EA 

i,j,l 


+  E  x*f*(vij  >Vjk>  Vik,  yij’Vjk >  ybik)) 


where  the  features  /*  are  consistency  checking  functions  used  to  enforce 
transitivity  among  deduplication  decisions.  For  example,  if  papers  x°  and  x°j 
are  coreferent,  and  x"  and  xf,  are  coreferent,  then  not  only  must  papers  and 
xf,  be  coreferent,  but  venues  xbi,xb:j,x\  must  also  be  coreferent.  (Note  /*  is  of 
notational  use  only  —  in  practice,  the  inference  algorithm  simply  avoids  these 
impossible  configurations.) 

Because  both  y”3  and  y\;j  are  arguments  to  the  feature  functions  fi,  these 
potentials  capture  the  cross-product  of  paper  and  venue  deduplication  deci¬ 
sions.  This  allows  the  learned  weights  to  encourage  merging  paper  records 
which  have  equivalent  venue  records,  and  to  discourage  merging  papers  with 
different  venues. 

The  cost  of  modeling  these  interdependencies  is  a  highly  connected  graphi¬ 
cal  model,  which  necessitates  approximations  in  both  inference  and  parameter 
estimation.  We  describe  these  approximations  below. 


4.3  Inference 

Inference  in  this  model  corresponds  to  finding  the  solution  to 

y*  =  (y°*,y6*)  =  argmaxpA(ya,yb|xa,x6,r) 
y 
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that  is,  finding  the  most  probable  deduplication  decisions  y*  given  xa,  xfc,  r  and 
the  learned  parameters  A. 

Exact  inference  in  this  model  is  intractable  because  the  space  of  possible  y 
is  exponential  in  the  number  of  records  x,  and  the  the  high  connectivity  of  the 
graph  precludes  a  feasible  dynamic  program  to  make  this  search  tractable. 

One  common  approximate  inference  technique  for  such  a  predicament  is  to 
perform  loopy  belief  propagation ;  that  is,  perform  standard  belief  propagation 
[21],  ignoring  the  “message  double-counting”  caused  by  the  cycles  in  the  graph. 
However,  the  severe  cyclicity  of  this  model  may  require  a  prohibitive  amount  of 
time  for  belief  propagation  to  converge,  if  it  converges  at  all. 

Instead,  we  follow  recent  work  which  finds  an  equivalence  between  graph 
partitioning  algorithms  and  inference  in  certain  undirected  graphical  models 
[6].  We  first  transform  our  graph  to  a  weighted,  undirected  graph  that  only 
contains  vertices  for  variables  x  and  has  edges  weighted  by  the  (log)  clique 
potential  for  each  pair  of  vertices.  The  value  on  these  edges  depends  on  which 
type  of  records  they  join. 

For  paper  edges,  we  define  the  weight 

<=  E  (EA^K6>^  =  1>4^b) 

<£{0,1}  1 

-  E  Vij  =  Vij’Kj)) 

l 

and  similarly  for  venue  edges: 

w\j=  E  (  E  Xl  ft  (Xb >  Vij  ’  Vij  =  1  ’  rij  ) 

<£{0,1}  l 

-^A^(x“b,2/“-,4=0,r“b)) 

i 

Intuitively,  the  paper  weights  can  be  thought  of  as  the  compatibility  of 
papers  x “,x“,  summed  over  possible  deduplication  decisions  for  venues  x\,xbj. 
Similarly,  the  venue  weights  w\^  can  be  thought  of  as  the  compatibility  of  venues 
x\,Xj,  summed  over  the  possible  deduplication  decisions  for  papers  xf,  x“.  In¬ 
terpreting  the  weights  as  the  similarity  between  two  records,  we  can  see  that 
the  similarity  of  paper  records  considers  the  similarity  of  their  venue  records, 
and  vice  versa. 

This  results  in  a  weighted,  undirected  graph  with  edge  weights  ranging  from 
—oo  to  +oo.  It  can  be  shown  that  finding  an  optimal  partitioning  of  this  graph 
corresponds  to  finding  the  optimal  configuration  y*  in  the  original  undirected 
graphical  model.  Here,  the  number  of  partitions  is  unknown,  as  it  corresponds 
to  the  number  of  unique  records. 


Although  graph  partitioning  with  positive  and  negative  edge  weights  is  NP- 
hard,  there  exist  several  good  approximations,  including  recent  work  in  cor¬ 
relation  clustering  [2].  Additionally,  McCallum  and  Wellner  [13]  have  found 
that  greedy  agglomerative  clustering  with  an  average  link  criterion  works  well 
in  practice. 

However,  traditional  partitioning  algorithms  would  not  account  for  the  known 
dependencies  between  clusters  that  exist  in  our  data.  Therefore,  we  develop  a 
novel,  relational  agglomerative  clustering  algorithm  that  exploits  these  depen¬ 
dencies. 

Traditional  greedy  agglomerative  clustering  first  initializes  each  vertex  to 
its  own  cluster,  then  iteratively  merges  the  clusters  that  are  “closest,”  where 
the  distance  between  clusters  is  often  defined  as  the  average  of  the  edge  weights 
connecting  the  two  clusters.  We  augment  this  algorithm  with  two  enhancements. 

First,  we  must  enforce  the  constraint  that  duplicate  papers  have  duplicate 
venues.  This  is  straight-forwardly  enforced  by  the  following  rule:  Whenever  a 
pair  of  paper  clusters  are  merged,  their  corresponding  venue  clusters  must  also 
be  merged. 

The  second  enhancement  redefines  the  distance  between  clusters  to  more 
accurately  reflect  the  impact  of  the  first  enhancement.  Let  Cf,  C'“  be  two  paper 
clusters  that  are  candidates  to  be  merged,  and  let  C^,Cj  be  the  venue  clus¬ 
ters  corresponding  to  these  papers.  The  first  enhancement  requires  that  if  we 
merge  Cf,  C° ,  we  must  also  merge  Cf .  Cj.  However,  the  current  distance  metric 
between  Cf ,  C“  does  not  reflect  this  fact. 

To  remedy  this,  we  redefine  the  distance  between  two  paper  clusters  (C“,  Cj") 
to  be  the  average  of  (1)  the  traditional  distance  between  the  paper  clusters 
(C“,  C!j)  and  (2)  the  traditional  distance  between  their  corresponding  venue 
clusters  (Cf,  Cj').  (Note  that  we  choose  the  average  rather  than  the  sum  to 
deal  with  papers  that  have  no  venue  information.)  This  metric  is  likely  to 
better  approximate  the  effect  merging  C“,  C'j  will  have  on  the  objective  function 
pA(y|x,  r),  since  it  accounts  for  the  merger  of  the  corresponding  venue  clusters. 

This  new  clustering  algorithm  provides  benefits  to  both  paper  and  venue 
deduplication  that  would  be  unavailable  in  an  independent  clustering  algorithm. 
As  illustrated  in  our  motivating  example  in  Section  3,  it  is  often  the  case  that  pa¬ 
per  duplicates  are  not  detected  because  they  have  venues  with  decidedly  different 
surface  forms  (e.g.  “CIKM”  and  “Conference  on  Knowledge  and  Information 
Management”).  The  second  enhancement  addresses  this  problem  by  using  the 
evidence  from  previous  venue  clusterings  to  inform  paper  deduplication.  Specif¬ 
ically,  if  “CIKM”  has  already  been  resolved  with  “  Conference  on  Knowledge 
and  Information  Management,”  then  merging  papers  with  venues  “CIKM”  and 
“Conference  on  Knowledge  and  Information  Management”  will  be  encouraged, 
since  there  will  be  a  high  similarity  between  their  associated  venue  clusters. 

Conversely,  by  the  hard  constraint  introduced  in  the  first  enhancement,  dif¬ 
ficult  venue  deduplication  decisions  are  informed  by  confident  paper  deduplica¬ 
tion  decisions,  as  was  also  illustrated  in  Section  3.  In  this  way,  deduplication 
decisions  for  both  record  types  simultaneously  grow  more  accurate. 
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4.4  Parameter  Estimation 

Given  a  labeled  corpus  of  fully  clustered  data,  maximum  likelihood  parameter 
estimation  corresponds  to  finding  the  parameters  A  which  maximize  the  log- 
likelihood  of  the  labeled  training  data.  Exact  estimation  is  intractable  here 
because  it  requires  calculating  the  normalization  term  ifx,  a  sum  over  all  possible 
values  of  y,  which  is  a  sum  over  all  possible  partitionings  of  the  data.  Due  to 
the  nature  of  the  data  and  the  high  connectivity  of  the  graph,  this  cannot  be 
efficiently  computed  with  a  dynamic  program. 

One  could  perform  stochastic  gradient  ascent  on  an  approximation  of  the 
likelihood.  However,  it  has  been  noted  that  maximizing  a  product  of  local 
marginals  performs  at  least  as  well  as  this  approximation  on  a  similar  corefer¬ 
ence  task,  if  not  better  [24,  12].  Whereas  in  [24]  the  local  marginals  are  over 
single  coreference  decisions,  here  we  maximize  a  product  of  joint  conditional 
probabilities  for  decisions  yfj  and  y^,  which  we  define  as 


PM 


4i 


4  = 


ZL 


exp 


„ab 


>  Vij  > 


where  Z 'x  is  a  normalization  constant  summing  over  possible  values  for  the  pair 

(//"i- //?/)• 

For  a  labeled  dataset  V,  the  log-likelihood  is  defined  as 


£a(X>)  =  log 


|  If  pA(yij,yij\xa,X-b,r) 


We  perform  gradient  ascent  on  C  by  maximizing  its  derivative: 


<x,y,r)e-D  i,j,l 

-  E  Ph{y'ij,yij\xa, xb, r) 

EA^(x^444) 

i,j,l 

Because  the  defined  likelihood  is  a  convex  function,  we  can  perform  gra¬ 
dient  ascent  using  any  suitable  optimization  algorithm.  In  particular,  we  use 
limited-memory  BFGS,  which  iteratively  approximates  second-order  curvature 
information  to  speed  up  convergence  [19]. 

The  estimation  method  can  also  be  viewed  as  learning  a  distance  metric 
between  paper-venue  pairs.  As  explained  in  Section  4.3,  this  metric  is  used 
to  weight  the  edges  in  the  deduplication  graph,  which  is  then  partitioned  at 
inference  time. 


dC_ 

d\ i 
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5  Experiments 

We  evaluate  our  model  on  two  datasets  of  research  paper  citations.  The  first 
is  from  Citeseer  [10],  containing  approximately  1500  citations,  with  900  unique 
papers  and  350  unique  venues.  The  second  is  from  the  Cora  Computer  Sci¬ 
ence  Research  Paper  Engine1,  containing  about  1800  citations,  with  600  unique 
papers  and  200  unique  venues. 

Both  datasets  are  manually  labeled  for  both  paper  coreference  and  venue 
coreference,  as  well  as  manually  segmented  into  fields,  such  as  author,  title,  etc. 
The  data  were  collected  by  searching  for  certain  authors  and  topic,  and  they  are 
split  into  subsets  with  non-overlapping  papers  for  the  sake  of  cross-validation 
experiments. 

We  used  a  number  of  feature  functions,  including  exact  and  approximate 
string  match2  on  normalized  and  unnormalized  values  for  the  following  citation 
fields:  title,  booktitle,  journal,  authors,  venue,  date,  editors,  institution,  and  the 
entire  unsegmented  citation  string.  We  also  calculated  an  unweighted  cosine 
similarity  between  tokens  in  the  title  and  author  fields.  Additional  features 
include  whether  or  not  the  papers  have  the  same  publication  type  (e.g.  journal 
or  conference),  as  well  as  the  numerical  distance  between  fields  such  as  year  and 
volume.  All  real  values  were  binned  and  converted  into  binary-valued  features. 

To  evaluate  performance,  we  compare  the  clusters  output  by  our  system  with 
the  true  clustering  using  pairwise  metrics.  Pairwise  precision  is  the  fraction  of 
pairs  in  the  same  cluster  that  are  coreferent;  pairwise  recall  is  the  fraction  of 
coreferent  papers  that  were  placed  in  the  same  cluster.  Pairwise  FI  is  the 
harmonic  mean  of  pairwise  precision  and  pairwise  recall. 

Tables  2  and  3  show  the  FI  performance  of  two  systems:  joint  is  the  system 
we  have  advocated  in  this  paper,  and  indep  is  the  system  which  deduplicates 
records  of  different  types  independently.  Note  that  this  system  corresponds 
to  model  (b)  in  Figure  1,  so  deduplication  decisions  are  made  collectively  for 
records  of  the  same  type,  as  in  McCallum  and  Wellner  [13].  Since  the  McCallum 
and  Wellner  model  has  been  shown  to  consistently  outperform  the  classical 
transitive  closure  model,  we  do  not  compare  with  the  classical  model  here. 

Results  are  listed  by  the  name  of  each  test  set;  the  remaining  sections  are 
used  for  training. 

Venue  performance  improves  considerably  in  the  joint  model,  which  is  plau¬ 
sible  considering  the  strong  influence  paper  deduplication  has  on  venue  dedu¬ 
plication.  Because  paper  deduplication  often  has  more  evidence  at  its  disposal 
than  does  venue  deduplication,  the  joint  model  dramatically  enhances  venue 
recall,  obtaining  a  5%  absolute  recall  boost  in  Citeseer,  and  a  9%  boost  in 
Cora  data.  This  is  especially  noticeable  when  paper  deduplication  performance 
is  high:  The  hard  constraint  requiring  the  venues  of  duplicate  papers  to  be 
merged  often  merges  venues  that  otherwise  would  have  seemed  too  dissimilar 
to  merge  on  their  own.  Indeed,  error  analysis  confirms  that  many  of  the  venue 

Jhttp:  / /www. cs.umass.edu/~mccallum/data/cora-refs.tar.gz 

2 We  used  the  Secondstring  package,  found  at  http://secondstring.sourceforge.net 
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Paper 

Venue 

indep 

joint 

indep 

joint 

constraint 

88.9 

91.0 

79.4 

94.1 

reinforce 

92.2 

92.2 

56.5 

60.1 

face 

88.2 

93.7 

80.9 

82.8 

reason 

97.4 

97.0 

75.6 

79.5 

Micro  Avg. 

91.7 

93.4 

73.1 

79.1 

Table  2:  Pairwise  FI  deduplication  performance  on  Citeseer  data. 


Paper 

Venue 

indep 

joint 

indep 

joint 

kibl 

92.9 

93.3 

93.6 

99.3 

fahl 

95.5 

95.0 

87.3 

99.7 

utgo 

79.9 

84.0 

51.7 

60.4 

Micro  Avg. 

89.4 

90.8 

77.5 

84.5 

Table  3:  Pairwise  FI  deduplication  performance  on  Cora  data. 


deduplication  errors  our  model  avoids  are  those  where  venues  are  dissimilar  in 
form,  but  are  related  to  papers  that  are  similar  in  form. 

More  interestingly,  a  noticeable  improvement  in  paper  deduplication  is  at¬ 
tained  by  the  collective  model.  Part  of  this  is  due  to  the  precision  enhancement 
provided  by  the  constrained  clustering  algorithm.  Workshop  and  technical  re¬ 
port  versions  of  journal  or  conference  papers  with  the  same  title  are  correctly  not 
merged  when  the  venues  are  accurately  identified.  Also,  error  analysis  suggests 
that  papers  that  would  not  have  been  otherwise  merged  were  merged  because 
their  venues  were  determined  to  be  coreferent. 

It  is  worth  noting  that  many  of  the  errors  made  by  the  joint  model  have 
causes  similar  to  those  that  on  average  have  improved  performance.  For  exam¬ 
ple,  if  paper  deduplication  accuracy  is  poor,  the  relational  clustering  algorithm 
can  result  in  many  venues  being  merged  that  would  not  have  been  otherwise. 
Future  work  should  investigate  how  to  detect  poor  paper  accuracy  and  adjust 
accordingly. 

5.1  Scalability 

While  the  datasets  used  in  our  experiments  are  of  reasonable  size,  we  would 
ultimately  like  to  apply  this  model  to  large  databases.  Here  we  briefly  discuss 
performance  issues  and  describe  methods  to  scale  our  model  to  real-world  data. 

Because  parameter  estimation  maximizes  the  product  of  local  node  poten¬ 
tials,  it  is  likely  to  be  much  faster  than  a  global  approximate  training  method 
such  as  loopy  belief  propagation.  For  the  data  used  in  these  experiments,  train¬ 
ing  time  averaged  about  15  minutes  on  dual-processor,  3.06  GHz  Xeon  machines 
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with  4  GB  of  RAM.  Inference  time  ranged  from  20  minutes  to  about  an  hour, 
depending  on  the  size  of  the  testing  data. 

To  make  inference  scalable,  a  practical  implementation  would  make  use  of 
“canopies”  [14] .  This  technique  reduces  the  connectivity  of  the  graph  of  records 
by  defining  a  cheap  similarity  metric  between  records  (often  using  an  inverted 
index  or  tf-idf).  When  constructing  the  record  graph,  edges  are  only  added  be¬ 
tween  those  records  that  have  similarity  above  some  threshold,  where  similarity 
is  the  output  of  the  cheap  metric.  In  this  way,  records  that  are  very  unlikely 
to  be  duplicates  are  not  considered  by  the  model.  This  provides  an  efficient, 
accurate  way  of  pruning  the  search  space. 

Besides  performance  issues,  there  is  reason  to  believe  that  the  advantages  of 
the  model  presented  in  this  paper  will  be  even  more  noticeable  in  larger  data 
sets,  where  there  is  more  heterogeneity  in  field  values,  more  interesting  relational 
patterns,  and  larger  record  clusters.  In  fact,  the  data  used  in  experiments 
presented  here  contain  many  singleton  clusters,  which  is  not  truly  reflective  of 
the  large  clusters  of  records  found  in  real-world  data. 

6  Conclusions 

We  have  introduced  a  collective  model  for  deduplication  of  related  records  of 
multiple  types  and  demonstrated  empirically  the  advantages  it  has  over  methods 
that  do  not  address  the  interdependencies  inherent  in  relational  data. 

Based  on  these  results,  two  promising  areas  of  future  research  are  (1)  ex¬ 
tending  the  model  to  databases  with  more  than  two  types  of  records,  and  (2) 
modeling  the  relation  variables  R.  In  addition  to  paper  and  venue  deduplica¬ 
tion,  author  deduplication  is  also  a  difficult  problem  that  would  likely  benefit 
from  this  approach,  and  we  are  in  the  process  of  harvesting  data  to  allow  us  to 
model  author,  venue,  and  paper  deduplication  jointly. 

In  the  publications  domain,  the  connections  between  author,  venue,  and  pa¬ 
per  deduplication  become  more  interesting  with  larger  databases,  where  com¬ 
munities  and  relations  become  more  visible.  In  particular,  exciting  challenges 
include  building  a  model  to  predict  ad visor.of  relations  between  authors,  suggest 
possible  venues  for  a  paper,  identify  fruitful  author  collaborations,  match  recent 
graduates  with  potential  research  labs,  and  discover  the  dynamics  of  research 
communities. 

The  challenge  as  usual  will  lie  in  developing  a  model  that  is  complex  enough 
to  model  these  long-distance  relations,  but  is  still  tractable  enough  to  perform 
on  real  data.  We  feel  that  the  model  proposed  here  is  a  productive  step  in  that 
long-term  direction. 
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